Adjust ndistinct for eqjoinsel

  • Jump to comment-1
    zlyu@vmware.com2022-07-15T14:06:54+00:00
    Hi, I run TPC-DS benchmark for Postgres and find the join size estimation has several problems. For example, Ndistinct is key to join selectivity's estimation, this value does not take restrictions of the rel, I hit some cases in the function eqjoinsel, nd is much larger than vardata.rel->rows. Accurate estimation need good math model that considering dependency of join var and vars in restriction. But at least, indistinct should not be greater than the number of rows. See the attached patch to adjust nd in eqjoinsel. Best, Zhenghua Lyu
    • Jump to comment-1
      tgl@sss.pgh.pa.us2022-07-15T15:56:41+00:00
      Zhenghua Lyu <zlyu@vmware.com> writes: > I run TPC-DS benchmark for Postgres and find the join size estimation has several problems. > For example, Ndistinct is key to join selectivity's estimation, this value does not take restrictions > of the rel, I hit some cases in the function eqjoinsel, nd is much larger than vardata.rel->rows. > Accurate estimation need good math model that considering dependency of join var and vars in restriction. > But at least, indistinct should not be greater than the number of rows. > See the attached patch to adjust nd in eqjoinsel. We're very unlikely to accept this with no test case and no explanation of why it's not an overcorrection. get_variable_numdistinct already clamps its result to rel->tuples, and I think that by using rel->rows instead you are probably double-counting the selectivity of the rel's restriction clauses. See the sad history of commit 7f3eba30c, which did something pretty close to this and eventually got almost entirely reverted (97930cf57, 0d3b231ee). I'd be the first to agree that better estimates here would be great, but it's not as simple as it looks. regards, tom lane